SelfAskRefusalScorer honors partial_content on blocked pieces#2083
SelfAskRefusalScorer honors partial_content on blocked pieces#2083tejas0077 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Hi! In my head, if the model or application filtered out/blocked a response (even if it was already being generated), this should be considered a refusal, which is why the default for score_blocked_content is False. If a user did want to score partial content, they could scorer.score_blocked_content = True to enable that behavior. Let me know if that makes sense or if you think otherwise :)
That's fair. In that case, this PR is actively doing the opposite of what I'd expect the default to be and we can close it. Feel free to comment @tejas0077 if you have other thoughts. |
Fixes #2044 (sub-issue #2)
SelfAskRefusalScorer unconditionally returned refusal=True when response_error == "blocked", even when partial_content was available in prompt_metadata. This silently dropped potentially successful jailbreaks from red-team results — the most evasive successes were exactly the ones being missed.
The fix sets score_blocked_content = True on SelfAskRefusalScorer so the base Scorer class handles partial content substitution via the existing _apply_blocked_content_substitution mechanism. When a blocked piece has partial_content, it is now scored via the LLM instead of being unconditionally treated as a clean refusal.
The rationale string for blocked responses with no partial content has also been updated to be more descriptive.
Tests and Documentation
Updated the existing test_score_async_filtered_response test to match the new rationale string and added a new test test_score_async_blocked_with_partial_content_scores_partial that verifies blocked pieces with partial content are forwarded to the LLM scorer instead of immediately returning refusal=True.